{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9K302jR6r3yU",
   "metadata": {
    "id": "9K302jR6r3yU"
   },
   "outputs": [],
   "source": [
    "# Copyright 2024 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ae379f7",
   "metadata": {},
   "source": [
    "<table align=\"left\">\n",
    "  <td style=\"text-align: center\">\n",
    "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/vertex_evaluation_services/gemini-curate-evaluation-data/curate_new_evals.ipynb\">\n",
    "      <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
    "    </a>\n",
    "  </td>\n",
    "  <td style=\"text-align: center\">\n",
    "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fgenai-on-vertex-ai/vertex_evaluation_services/gemini-curate-evaluation-data/curate_new_evals.ipynb\">\n",
    "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
    "    </a>\n",
    "  </td>\n",
    "  <td style=\"text-align: center\">\n",
    "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/applied-ai-engineering-samples/main/genai-on-vertex-ai/vertex_evaluation_services/gemini-curate-evaluation-data/curate_new_evals.ipynb\">\n",
    "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
    "    </a>\n",
    "  </td>\n",
    "  <td style=\"text-align: center\">\n",
    "    <a href=\"https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/vertex_evaluation_services/gemini-curate-evaluation-data/curate_new_evals.ipynb\">\n",
    "      <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
    "    </a>\n",
    "  </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cfc1dcf",
   "metadata": {},
   "source": [
    "| | |\n",
    "|-|-|\n",
    "| Author(s) | [Ken Lee](https://github.com/kenleejr) |\n",
    "| Reviewers(s) | [Abhishek Bhagwat](https://github.com/Abhishekbhagwat)|\n",
    "| Last updated | 2024-10-07 |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "HFXYv9sOVkYq",
   "metadata": {
    "id": "HFXYv9sOVkYq"
   },
   "source": [
    "Deploying a chatbot or retrieval augmented generation (RAG) application to real users provides a wealth of valuable data.  User queries reveal insights into their needs, the products they engage with, and the effectiveness of the chatbot itself. This data is crucial for both understanding your users and continuously evaluating the performance of your deployed system.\n",
    "\n",
    "![image_link](./cluster_diagram.png)\n",
    "\n",
    "\n",
    "This notebook demonstrates how to leverage Gemini to accelerate the analysis and summarization of real user queries from a production RAG system or chatbot. By analyzing these queries, we can identify a representative set of questions to form an evaluation dataset, establishing a foundation for continuous evaluation.\n",
    "\n",
    "\n",
    "This process aims to answer the following questions:\n",
    "\n",
    "- What general categories of questions are users asking? What problems are they encountering?\n",
    "\n",
    "- What topics are prevalent in user conversations?\n",
    "\n",
    "- What sentiments are users expressing?\n",
    "\n",
    "Inspired by a [Weights and Biases article](https://wandb.ai/wandbot/wandbot-eval/reports/How-to-Evaluate-an-LLM-Part-1-Building-an-Evaluation-Dataset-for-our-LLM-System--Vmlldzo1NTAwNTcy), this notebook extends those concepts by utilizing Gemini's capabilities.  Gemini's large context window allows for rapid exploratory data analysis (EDA) of clustered questions, even with extensive datasets, facilitating efficient metadata extraction and informed selection of an evaluation dataset.  This, in turn, enables the construction of a robust and representative evaluation set for the RAG system."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0557777",
   "metadata": {},
   "source": [
    "# 🎬 Getting Started\n",
    "\n",
    "The following steps are necessary to run this notebook, no matter what notebook environment you're using.\n",
    "\n",
    "If you're entirely new to Google Cloud, [get started here](https://cloud.google.com/docs/get-started).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3eabeeb",
   "metadata": {},
   "source": [
    "### Google Cloud Permissions\n",
    "\n",
    "**To run the complete Notebook, including the optional section, you will need to have the [Owner role](https://cloud.google.com/iam/docs/understanding-roles) for your project.**\n",
    "\n",
    "If you want to skip the optional section, you need at least the following [roles](https://cloud.google.com/iam/docs/granting-changing-revoking-access):\n",
    "* **`roles/serviceusage.serviceUsageAdmin`** to enable APIs\n",
    "* **`roles/iam.serviceAccountAdmin`** to modify service agent permissions\n",
    "* **`roles/aiplatform.user`** to use AI Platform components"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9c9284b",
   "metadata": {},
   "source": [
    "### Install Vertex AI SDK and Other Required Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "TMrCwE6KVsW3",
   "metadata": {
    "collapsed": true,
    "id": "TMrCwE6KVsW3",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "!pip install -qqq llama-index \\\n",
    "llama-index-llms-vertex \\\n",
    "llama-index-embeddings-vertex \\\n",
    "python-louvain \\\n",
    "tiktoken \\\n",
    "aiofiles \\\n",
    "annotated-types \\\n",
    "python-fasthtml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5447b19b",
   "metadata": {},
   "source": [
    "### Restart Runtime\n",
    "\n",
    "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5e675c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Restart kernel after installs so that your environment can access the new packages\n",
    "import IPython\n",
    "\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f67891c",
   "metadata": {},
   "source": [
    "### Authenticate\n",
    "\n",
    "If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects).\n",
    "\n",
    "If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into [Application Default Credentials for your local environment](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev) and [initializing the Google Cloud CLI](https://cloud.google.com/docs/authentication/gcloud). In many cases, running `gcloud auth application-default login` in a shell on the machine running the notebook kernel is sufficient.\n",
    "\n",
    "More authentication options are discussed [here](https://cloud.google.com/docs/authentication)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "039a18d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Colab authentication.\n",
    "import sys\n",
    "\n",
    "if \"google.colab\" in sys.modules:\n",
    "    from google.colab import auth\n",
    "    auth.authenticate_user()\n",
    "    print('Authenticated')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "DGTF2qUYVp0x",
   "metadata": {
    "id": "DGTF2qUYVp0x"
   },
   "source": [
    "### Set Google Cloud project information and Initialize Vertex AI SDK\n",
    "\n",
    "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n",
    "\n",
    "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).\n",
    "\n",
    "Make sure to change `PROJECT_ID` in the next cell. You can leave the values for `REGION` unless you have a specific reason to change them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ODApYPsB6nNR",
   "metadata": {
    "id": "ODApYPsB6nNR"
   },
   "outputs": [],
   "source": [
    "import vertexai\n",
    "\n",
    "PROJECT_ID = \"<enter-your-project-id>\"\n",
    "REGION = \"us-central1\"\n",
    "CSV_PATH = \"./curate_evals_example.csv\"\n",
    "TEST_RUN = False\n",
    "CLUSTERING_NEIGHBORHOOD_SIZE = 5\n",
    "\n",
    "vertexai.init(\n",
    "    project=PROJECT_ID,\n",
    "    location=REGION\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7GLS_O8p1hSW",
   "metadata": {
    "id": "7GLS_O8p1hSW"
   },
   "source": [
    "## Prepare the dataset\n",
    "\n",
    "For this demo we are using a hypothetical dataset of questions about Google Cloud Services"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "I9Ll-ZSrQtcI",
   "metadata": {
    "id": "I9Ll-ZSrQtcI"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "if TEST_RUN:\n",
    "  df = pd.DataFrame({\"Prompt\": [\"What is RAG?\", \"What is life?\", \"What is football?\", \"Who am I?\"],\n",
    "                   \"answer\": [\"Retrieval Augmented Generation\", \"Love\", \"National Football League\", \"Human\"]})\n",
    "else:\n",
    "  df = pd.read_csv(CSV_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cH5k_Mv0ABpW",
   "metadata": {
    "id": "cH5k_Mv0ABpW"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Topic</th>\n",
       "      <th>Question</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>How can I create a virtual machine instance on...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>\"What are the different machine types availabl...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>\"Can you explain the different pricing options...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>\"How do I connect to my Compute Engine instanc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>\"What are preemptible instances, and how can t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"How can I track and manage my Google Cloud co...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"What are the different pricing models for Goo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"How can I optimize my Google Cloud costs?\"</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"What tools are available for cost management ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"How can I set budgets and alerts for my Googl...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              Topic                                           Question\n",
       "0    Compute Engine  How can I create a virtual machine instance on...\n",
       "1    Compute Engine  \"What are the different machine types availabl...\n",
       "2    Compute Engine  \"Can you explain the different pricing options...\n",
       "3    Compute Engine  \"How do I connect to my Compute Engine instanc...\n",
       "4    Compute Engine  \"What are preemptible instances, and how can t...\n",
       "..              ...                                                ...\n",
       "95  Cost Management  \"How can I track and manage my Google Cloud co...\n",
       "96  Cost Management  \"What are the different pricing models for Goo...\n",
       "97  Cost Management        \"How can I optimize my Google Cloud costs?\"\n",
       "98  Cost Management  \"What tools are available for cost management ...\n",
       "99  Cost Management  \"How can I set budgets and alerts for my Googl...\n",
       "\n",
       "[100 rows x 2 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "Q9nFwjmP72Ka",
   "metadata": {
    "id": "Q9nFwjmP72Ka"
   },
   "source": [
    "### Dataset Preprocessing\n",
    "\n",
    "Real world RAG systems have some anomalies in terms of the search queries - often, you will encounter single word queries or typos. In this step, we will preprocess and clean the dataset to remove the following types of queries:\n",
    "- Very short and very long queries\n",
    "- Near duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "l4H8oYXq6wOK",
   "metadata": {
    "id": "l4H8oYXq6wOK"
   },
   "outputs": [],
   "source": [
    "df[\"question_len\"] = df[\"Question\"].apply(lambda x: len(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "TBi-emAT_MdY",
   "metadata": {
    "id": "TBi-emAT_MdY"
   },
   "outputs": [],
   "source": [
    "# Discard questions with too little or too many characters\n",
    "df = df[(df.question_len > 5) & (df.question_len < 1000)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "xhgpghHBN7-E",
   "metadata": {
    "id": "xhgpghHBN7-E"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Topic</th>\n",
       "      <th>Question</th>\n",
       "      <th>question_len</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>How can I create a virtual machine instance on...</td>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>\"What are the different machine types availabl...</td>\n",
       "      <td>115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>\"Can you explain the different pricing options...</td>\n",
       "      <td>77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>\"How do I connect to my Compute Engine instanc...</td>\n",
       "      <td>59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Compute Engine</td>\n",
       "      <td>\"What are preemptible instances, and how can t...</td>\n",
       "      <td>65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"How can I track and manage my Google Cloud co...</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"What are the different pricing models for Goo...</td>\n",
       "      <td>66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"How can I optimize my Google Cloud costs?\"</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"What tools are available for cost management ...</td>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>Cost Management</td>\n",
       "      <td>\"How can I set budgets and alerts for my Googl...</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              Topic                                           Question  \\\n",
       "0    Compute Engine  How can I create a virtual machine instance on...   \n",
       "1    Compute Engine  \"What are the different machine types availabl...   \n",
       "2    Compute Engine  \"Can you explain the different pricing options...   \n",
       "3    Compute Engine  \"How do I connect to my Compute Engine instanc...   \n",
       "4    Compute Engine  \"What are preemptible instances, and how can t...   \n",
       "..              ...                                                ...   \n",
       "95  Cost Management  \"How can I track and manage my Google Cloud co...   \n",
       "96  Cost Management  \"What are the different pricing models for Goo...   \n",
       "97  Cost Management        \"How can I optimize my Google Cloud costs?\"   \n",
       "98  Cost Management  \"What tools are available for cost management ...   \n",
       "99  Cost Management  \"How can I set budgets and alerts for my Googl...   \n",
       "\n",
       "    question_len  \n",
       "0             63  \n",
       "1            115  \n",
       "2             77  \n",
       "3             59  \n",
       "4             65  \n",
       "..           ...  \n",
       "95            51  \n",
       "96            66  \n",
       "97            43  \n",
       "98            63  \n",
       "99            64  \n",
       "\n",
       "[100 rows x 3 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48c5d108",
   "metadata": {},
   "source": [
    "#### Visualize distribution of question lengths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "YZP8uPVh4K97",
   "metadata": {
    "id": "YZP8uPVh4K97"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Axes: >"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAh8AAAGdCAYAAACyzRGfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAdpklEQVR4nO3df5DU9X348dfCnQtnOFSowOkhJGPHCERNiQ4x05qvIHWImqZjoxLLkE7SVFpFOkRMih4xRCFTSk0ciJlJ006DJplEk5hRcyGJlBGRn0lsMkim1FANUDVwyMV1c/f5/tFhx4MDjuOz7709H4+ZG73Pfvbzed+Lz949Z+/HFrIsywIAIJEhtV4AAPDWIj4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACCphlov4Ejd3d3x0ksvxYgRI6JQKNR6OQBAH2RZFgcPHoyWlpYYMuT4z20MuPh46aWXorW1tdbLAAD6Yffu3XHuueced58BFx8jRoyIiP9bfHNzc7+PUy6X4wc/+EFcddVV0djYmNfyOAbzTsu80zLvtMw7rbzm3dHREa2trZWv48cz4OLj8LdampubTzk+mpqaorm52cWbgHmnZd5pmXda5p1W3vPuy49M+IFTACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACR10vGxbt26uOaaa6KlpSUKhUI8+uijPW7PsizuuuuuGDduXAwfPjymT58eO3fuzGu9AECdO+n4OHToUFx00UXxwAMP9Hr78uXL4/7774/Vq1fHxo0b4/TTT4+ZM2fG66+/fsqLBQDq30m/sNzVV18dV199da+3ZVkWK1eujH/4h3+I6667LiIi/u3f/i3GjBkTjz76aNxwww2ntloAoO7l+qq2u3btij179sT06dMr20aOHBmXXXZZbNiwodf4KJVKUSqVKu93dHRExP+9yl65XO73Wg7f91SOQd+Zd1rmnZZ5p2XeaeU175O5f67xsWfPnoiIGDNmTI/tY8aMqdx2pHvvvTeWLFly1PYf/OAH0dTUdMpram9vP+Vj0HfmnZZ5p2XeaZl3Wqc6787Ozj7vm2t89Medd94ZCxYsqLzf0dERra2tcdVVV0Vzc3O/j1sul6O9vT1mzJgRjY2NeSyV46jWvCe3PZnLcZ5rm5nLcQYK13da5p2WeaeV17wPf+eiL3KNj7Fjx0ZExN69e2PcuHGV7Xv37o2LL7641/sUi8UoFotHbW9sbMzlosvrOPRN3vMudRVyOc5gvQZc32mZd1rmndapzvtk7pvr3/mYOHFijB07NtauXVvZ1tHRERs3boxp06bleSoAoE6d9DMfr732WvzqV7+qvL9r167Yvn17nHXWWTF+/PiYP39+fPazn43zzz8/Jk6cGIsXL46Wlpb44Ac/mOe6AYA6ddLxsXnz5nj/+99fef/wz2vMmTMnvvrVr8YnP/nJOHToUHz84x+P/fv3x/ve97544oknYtiwYfmtGgCoWycdH1dccUVkWXbM2wuFQnzmM5+Jz3zmM6e0MABgcPLaLgBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgqdzjo6urKxYvXhwTJ06M4cOHxzve8Y645557IsuyvE8FANShhrwPuGzZsli1alX867/+a0yaNCk2b94cc+fOjZEjR8att96a9+kAgDqTe3w8/fTTcd1118WsWbMiImLChAnx0EMPxbPPPpv3qQCAOpR7fLz3ve+NBx98MJ5//vn4wz/8w/jpT38a69evjxUrVvS6f6lUilKpVHm/o6MjIiLK5XKUy+V+r+PwfU/lGPRdteZdHJrPt+sG23Xg+k7LvNMy77TymvfJ3L+Q5fzDGN3d3fGpT30qli9fHkOHDo2urq5YunRp3Hnnnb3u39bWFkuWLDlq+5o1a6KpqSnPpQEAVdLZ2Rk33XRTHDhwIJqbm4+7b+7x8fDDD8fChQvj85//fEyaNCm2b98e8+fPjxUrVsScOXOO2r+3Zz5aW1vj5ZdfPuHij6dcLkd7e3vMmDEjGhsb+30c+ubIeU9ue7LWS+rhubaZtV5CrlzfaZl3WuadVl7z7ujoiNGjR/cpPnL/tsvChQtj0aJFccMNN0RExJQpU+KFF16Ie++9t9f4KBaLUSwWj9re2NiYy0WX13Hom8PzLnUVar2UHgbrNeD6Tsu80zLvtE513idz39x/1bazszOGDOl52KFDh0Z3d3fepwIA6lDuz3xcc801sXTp0hg/fnxMmjQptm3bFitWrIiPfvSjeZ8KAKhDucfHF77whVi8eHHccsstsW/fvmhpaYm//uu/jrvuuivvUwEAdSj3+BgxYkSsXLkyVq5cmfehAYBBwGu7AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKQaar0A8jFh0fdrev7i0CyWXxoxue3JKHUVarqW3uQ1n/++b1Yux4GBwmODWvDMBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACCpqsTHiy++GB/5yEdi1KhRMXz48JgyZUps3ry5GqcCAOpMQ94H/O1vfxuXX355vP/974/HH388/uAP/iB27twZZ555Zt6nAgDqUO7xsWzZsmhtbY1/+Zd/qWybOHFi3qcBAOpU7vHx3e9+N2bOnBnXX399PPXUU3HOOefELbfcEh/72Md63b9UKkWpVKq839HRERER5XI5yuVyv9dx+L6ncox6Uhya1fb8Q7Ie/x2sBsr19Fa7vmttMM87r88dec5mMM97IMpr3idz/0KWZbl+tRg2bFhERCxYsCCuv/762LRpU9x2222xevXqmDNnzlH7t7W1xZIlS47avmbNmmhqaspzaQBAlXR2dsZNN90UBw4ciObm5uPum3t8nHbaaTF16tR4+umnK9tuvfXW2LRpU2zYsOGo/Xt75qO1tTVefvnlEy7+eMrlcrS3t8eMGTOisbGx38epF5Pbnqzp+YtDsrhnancs3jwkSt2Fmq6lmp5rm1nrJUTEW+/6rrXBPO+8Pnfk+dgYzPMeiPKad0dHR4wePbpP8ZH7t13GjRsXF154YY9t73znO+Nb3/pWr/sXi8UoFotHbW9sbMzlosvrOANdqWtgfMEvdRcGzFqqYaBdS2+V63ugGIzzzuvxWo25DMZ5D2SnOu+TuW/uv2p7+eWXx44dO3pse/755+O8887L+1QAQB3KPT5uv/32eOaZZ+Jzn/tc/OpXv4o1a9bEgw8+GPPmzcv7VABAHco9Pt7znvfEI488Eg899FBMnjw57rnnnli5cmXMnj0771MBAHUo95/5iIj4wAc+EB/4wAeqcWgAoM55bRcAICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQVEOtF/BWN2HR92u9BCARj/cTm7Do+1EcmsXySyMmtz0Zpa5Cv47z3/fNynll5MkzHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkqh4f9913XxQKhZg/f361TwUA1IGqxsemTZviS1/6UrzrXe+q5mkAgDpStfh47bXXYvbs2fHlL385zjzzzGqdBgCoMw3VOvC8efNi1qxZMX369PjsZz97zP1KpVKUSqXK+x0dHRERUS6Xo1wu9/v8h+97KsdIoTg0q/USclEckvX472A1UK6nerm+B4u85j1YHu+9yetaLA7Ncvl84rHRd3ld3ydz/0KWZbk/Gh5++OFYunRpbNq0KYYNGxZXXHFFXHzxxbFy5cqj9m1ra4slS5YctX3NmjXR1NSU99IAgCro7OyMm266KQ4cOBDNzc3H3Tf3+Ni9e3dMnTo12tvbKz/rcbz46O2Zj9bW1nj55ZdPuPjjKZfL0d7eHjNmzIjGxsbK9sltT/b7mG/2XNvMXI6T13pqrTgki3umdsfizUOi1F2o9XKqZqD8ux+e95HXd60N1sfXtk//v14/n9RqPQNRnv9mA+nzSV4f10B2rK+XJ6ujoyNGjx7dp/jI/dsuW7ZsiX379sW73/3uyraurq5Yt25dfPGLX4xSqRRDhw6t3FYsFqNYLB51nMbGxlw+qR55nFJXPhdyXp/w81rPQFHqLgy6j+nNBtq/e16Pk7wM1sfX4fWc6rw9Nk7szTMaCJ9PBtLjq9pO9fo+mfvmHh9XXnll/PznP++xbe7cuXHBBRfEHXfc0SM8AIC3ntzjY8SIETF58uQe204//fQYNWrUUdsBgLcef+EUAEiqar9q+2Y/+clPUpwGAKgDnvkAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACCphlovoF5NWPT9Wi8BSGRy25Ox/NL/+2+pq1Dr5QxIPidyMjzzAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEgq9/i499574z3veU+MGDEizj777PjgBz8YO3bsyPs0AECdyj0+nnrqqZg3b14888wz0d7eHuVyOa666qo4dOhQ3qcCAOpQQ94HfOKJJ3q8/9WvfjXOPvvs2LJlS/zxH/9x3qcDAOpM7vFxpAMHDkRExFlnndXr7aVSKUqlUuX9jo6OiIgol8tRLpf7fd7D9z3yGMWhWb+PybEVh2Q9/jtYnco1+Waneh0ennNe68lLXo+vgTLnynHeItf3QDGQ5j3QHmPVcKyvl/09Tl8Usiyr2r9ud3d3XHvttbF///5Yv359r/u0tbXFkiVLjtq+Zs2aaGpqqtbSAIAcdXZ2xk033RQHDhyI5ubm4+5b1fj4m7/5m3j88cdj/fr1ce655/a6T2/PfLS2tsbLL798wsUfT7lcjvb29pgxY0Y0NjZWtk9ue7Lfx+TYikOyuGdqdyzePCRK3YVaL2fQOzzvI6/vWsvr8fVc28xcjpPXelzfaQ2keed1LQ5kx/p6ebI6Ojpi9OjRfYqPqn3b5W//9m/jsccei3Xr1h0zPCIiisViFIvFo7Y3Njbm8kn1yOOUunziqKZSd8GME8rrcZKXvP7t8/qY8r4WXd9pDYR5D6THV7Wd6ueTk7lv7vGRZVn83d/9XTzyyCPxk5/8JCZOnJj3KQCAOpZ7fMybNy/WrFkT3/nOd2LEiBGxZ8+eiIgYOXJkDB8+PO/TAQB1Jve/87Fq1ao4cOBAXHHFFTFu3LjK29e//vW8TwUA1KGqfNsFAOBYvLYLAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASKqh1gsA+m9y25NR6iqc8nH++75ZOawGOJYJi75f6yX0UOvHvGc+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAEmJDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKTEBwCQlPgAAJISHwBAUuIDAEhKfAAASYkPACAp8QEAJCU+AICkxAcAkJT4AACSEh8AQFLiAwBISnwAAElVLT4eeOCBmDBhQgwbNiwuu+yyePbZZ6t1KgCgjlQlPr7+9a/HggUL4u67746tW7fGRRddFDNnzox9+/ZV43QAQB2pSnysWLEiPvaxj8XcuXPjwgsvjNWrV0dTU1N85StfqcbpAIA60pD3Ad94443YsmVL3HnnnZVtQ4YMienTp8eGDRuO2r9UKkWpVKq8f+DAgYiIePXVV6NcLvd7HeVyOTo7O+OVV16JxsbGyvaG3x/q9zE5tobuLDo7u6OhPCS6ugu1Xs6gl/e8X3nllRxWld/ja6Ctx/Wd1kCa90C7FvPy5o/rWF8vT9bBgwcjIiLLshPvnOXsxRdfzCIie/rpp3tsX7hwYXbppZcetf/dd9+dRYQ3b968efPmbRC87d69+4StkPszHyfrzjvvjAULFlTe7+7ujldffTVGjRoVhUL/i7ejoyNaW1tj9+7d0dzcnMdSOQ7zTsu80zLvtMw7rbzmnWVZHDx4MFpaWk64b+7xMXr06Bg6dGjs3bu3x/a9e/fG2LFjj9q/WCxGsVjsse2MM87IbT3Nzc0u3oTMOy3zTsu80zLvtPKY98iRI/u0X+4/cHraaafFH/3RH8XatWsr27q7u2Pt2rUxbdq0vE8HANSZqnzbZcGCBTFnzpyYOnVqXHrppbFy5co4dOhQzJ07txqnAwDqSFXi48Mf/nD87//+b9x1112xZ8+euPjii+OJJ56IMWPGVON0vSoWi3H33Xcf9S0dqsO80zLvtMw7LfNOqxbzLmRZX34nBgAgH17bBQBISnwAAEmJDwAgKfEBACQ1qOLjvvvui0KhEPPnz69se/3112PevHkxatSoeNvb3hZ//ud/ftQfQOPkvPjii/GRj3wkRo0aFcOHD48pU6bE5s2bK7dnWRZ33XVXjBs3LoYPHx7Tp0+PnTt31nDF9aurqysWL14cEydOjOHDh8c73vGOuOeee3q8doJ599+6devimmuuiZaWligUCvHoo4/2uL0vs3311Vdj9uzZ0dzcHGeccUb81V/9Vbz22msJP4r6cbx5l8vluOOOO2LKlClx+umnR0tLS/zlX/5lvPTSSz2OYd59d6Lr+80+8YlPRKFQiJUrV/bYXq15D5r42LRpU3zpS1+Kd73rXT2233777fG9730vvvnNb8ZTTz0VL730UnzoQx+q0Srr329/+9u4/PLLo7GxMR5//PH4xS9+Ef/4j/8YZ555ZmWf5cuXx/333x+rV6+OjRs3xumnnx4zZ86M119/vYYrr0/Lli2LVatWxRe/+MX45S9/GcuWLYvly5fHF77whco+5t1/hw4diosuuigeeOCBXm/vy2xnz54d//mf/xnt7e3x2GOPxbp16+LjH/94qg+hrhxv3p2dnbF169ZYvHhxbN26Nb797W/Hjh074tprr+2xn3n33Ymu78MeeeSReOaZZ3r9s+hVm/epv5Rc7R08eDA7//zzs/b29uxP/uRPsttuuy3Lsizbv39/1tjYmH3zm9+s7PvLX/4yi4hsw4YNNVptfbvjjjuy973vfce8vbu7Oxs7dmz2+c9/vrJt//79WbFYzB566KEUSxxUZs2alX30ox/tse1DH/pQNnv27CzLzDtPEZE98sgjlff7Mttf/OIXWURkmzZtquzz+OOPZ4VCIXvxxReTrb0eHTnv3jz77LNZRGQvvPBClmXmfSqONe//+Z//yc4555zsueeey84777zsn/7pnyq3VXPeg+KZj3nz5sWsWbNi+vTpPbZv2bIlyuVyj+0XXHBBjB8/PjZs2JB6mYPCd7/73Zg6dWpcf/31cfbZZ8cll1wSX/7ylyu379q1K/bs2dNj5iNHjozLLrvMzPvhve99b6xduzaef/75iIj46U9/GuvXr4+rr746Isy7mvoy2w0bNsQZZ5wRU6dOrewzffr0GDJkSGzcuDH5mgebAwcORKFQqLzel3nnq7u7O26++eZYuHBhTJo06ajbqznvmr+q7al6+OGHY+vWrbFp06ajbtuzZ0+cdtppR71Q3ZgxY2LPnj2JVji4/Nd//VesWrUqFixYEJ/61Kdi06ZNceutt8Zpp50Wc+bMqcz1yL9ma+b9s2jRoujo6IgLLrgghg4dGl1dXbF06dKYPXt2RIR5V1FfZrtnz544++yze9ze0NAQZ511lvmfotdffz3uuOOOuPHGGysvdmbe+Vq2bFk0NDTErbfe2uvt1Zx3XcfH7t2747bbbov29vYYNmxYrZfzltDd3R1Tp06Nz33ucxERcckll8Rzzz0Xq1evjjlz5tR4dYPPN77xjfja174Wa9asiUmTJsX27dtj/vz50dLSYt4MWuVyOf7iL/4isiyLVatW1Xo5g9KWLVvin//5n2Pr1q1RKBSSn7+uv+2yZcuW2LdvX7z73e+OhoaGaGhoiKeeeiruv//+aGhoiDFjxsQbb7wR+/fv73G/vXv3xtixY2uz6Do3bty4uPDCC3tse+c73xm//vWvIyIqcz3yN4rMvH8WLlwYixYtihtuuCGmTJkSN998c9x+++1x7733RoR5V1NfZjt27NjYt29fj9t///vfx6uvvmr+/XQ4PF544YVob2/v8RLv5p2f//iP/4h9+/bF+PHjK18/X3jhhfj7v//7mDBhQkRUd951HR9XXnll/PznP4/t27dX3qZOnRqzZ8+u/H9jY2OsXbu2cp8dO3bEr3/965g2bVoNV16/Lr/88tixY0ePbc8//3ycd955ERExceLEGDt2bI+Zd3R0xMaNG828Hzo7O2PIkJ4P06FDh0Z3d3dEmHc19WW206ZNi/3798eWLVsq+/zoRz+K7u7uuOyyy5Kvud4dDo+dO3fGD3/4wxg1alSP2807PzfffHP87Gc/6/H1s6WlJRYuXBhPPvlkRFR53qf046oD0Jt/2yXLsuwTn/hENn78+OxHP/pRtnnz5mzatGnZtGnTarfAOvfss89mDQ0N2dKlS7OdO3dmX/va17Kmpqbs3//93yv73HfffdkZZ5yRfec738l+9rOfZdddd102ceLE7He/+10NV16f5syZk51zzjnZY489lu3atSv79re/nY0ePTr75Cc/WdnHvPvv4MGD2bZt27Jt27ZlEZGtWLEi27ZtW+W3K/oy2z/90z/NLrnkkmzjxo3Z+vXrs/PPPz+78cYba/UhDWjHm/cbb7yRXXvttdm5556bbd++PfvNb35TeSuVSpVjmHffnej6PtKRv+2SZdWb96CPj9/97nfZLbfckp155plZU1NT9md/9mfZb37zm9otcBD43ve+l02ePDkrFovZBRdckD344IM9bu/u7s4WL16cjRkzJisWi9mVV16Z7dixo0arrW8dHR3Zbbfdlo0fPz4bNmxY9va3vz379Kc/3eOTsXn3349//OMsIo56mzNnTpZlfZvtK6+8kt14443Z2972tqy5uTmbO3dudvDgwRp8NAPf8ea9a9euXm+LiOzHP/5x5Rjm3Xcnur6P1Ft8VGvehSx7059KBACosrr+mQ8AoP6IDwAgKfEBACQlPgCApMQHAJCU+AAAkhIfAEBS4gMASEp8AABJiQ8AICnxAQAkJT4AgKT+PwNz9/133e9BAAAAAElFTkSuQmCC",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df.question_len.hist(bins=25)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lIB2TGLKRiWu",
   "metadata": {
    "id": "lIB2TGLKRiWu"
   },
   "source": [
    "## Generating the embeddings for the questions\n",
    "\n",
    "Vertex AI embeddings models can generate optimized embeddings for various task types, such as document retrieval, question and answering, and fact verification. Task types are labels that optimize the embeddings that the model generates based on your intended use case.\n",
    "\n",
    "In this example, we will set the `TASK_TYPE` as `RETRIEVAL_DOCUMENT` as this is used to generate embeddings that are optimized for information retrieval\n",
    "\n",
    "Read more about the various `TASK_TYPE` offered by Vertex AI Embedding models [here](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/task-types)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "VSOR8rf3TZDq",
   "metadata": {
    "id": "VSOR8rf3TZDq"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 100/100 [00:00<00:00, 302.49it/s]\n"
     ]
    }
   ],
   "source": [
    "import asyncio\n",
    "from tqdm.asyncio import tqdm_asyncio\n",
    "from typing import List, Optional,  Tuple\n",
    "from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel\n",
    "from google.cloud import storage\n",
    "from vertexai.generative_models import GenerativeModel\n",
    "\n",
    "async def embed_text_async(\n",
    "    model: TextEmbeddingModel,\n",
    "    texts: List[str] = [\"banana muffins? \", \"banana bread? banana muffins?\"],\n",
    "    task: str = \"RETRIEVAL_DOCUMENT\",\n",
    "    dimensionality: Optional[int] = 768,):\n",
    "    inputs = [TextEmbeddingInput(text, task) for text in texts]\n",
    "    kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}\n",
    "    embeddings = await model.get_embeddings_async(texts, **kwargs)\n",
    "    return [embedding.values for embedding in embeddings]\n",
    "\n",
    "# embedding model to use\n",
    "model_name = \"text-embedding-005\"\n",
    "embedding_model = TextEmbeddingModel.from_pretrained(model_name)\n",
    "\n",
    "# embed questions from the dataset asynchronously\n",
    "embedded_qs = await tqdm_asyncio.gather(*[embed_text_async(embedding_model,\n",
    "                                        [x[\"Question\"]]) for i, x in df.iterrows()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "a3pUtDfnU_UM",
   "metadata": {
    "id": "a3pUtDfnU_UM"
   },
   "outputs": [],
   "source": [
    "embedded_qs_flattened = [q[0] for q in embedded_qs]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "kgTbuokcRl1L",
   "metadata": {
    "id": "kgTbuokcRl1L"
   },
   "source": [
    "## Cluster the Questions\n",
    "\n",
    "While various clustering algorithms can be applied, Louvain community detection is a particularly suitable choice for this task due to its speed and effectiveness."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "NAlsWHDXz5v9",
   "metadata": {
    "id": "NAlsWHDXz5v9"
   },
   "source": [
    "### Vector-based Retrieval Clustering\n",
    "1. Store your embedded question set in a vector index\n",
    "2. Query the vector index with each question in the dataset, retrieving a topk-sized neighborhood of questions around the query question.\n",
    "3. Form a graph of questions by adding an edge between the query question and each of the retrieved questions\n",
    "4. Perform Louvain or Leiden community detection on the graph to create clusters of questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "ML6i053tW3vQ",
   "metadata": {
    "collapsed": true,
    "id": "ML6i053tW3vQ",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 100/100 [00:01<00:00, 77.81it/s]\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core import (\n",
    "    VectorStoreIndex,\n",
    "    Settings,\n",
    "    SimpleDirectoryReader,\n",
    "    load_index_from_storage,\n",
    "    StorageContext,\n",
    "    Document\n",
    ")\n",
    "from llama_index.llms.vertex import Vertex\n",
    "from llama_index.embeddings.vertex import VertexTextEmbedding\n",
    "from vertexai.generative_models import HarmCategory, HarmBlockThreshold\n",
    "import networkx as nx\n",
    "from community import community_louvain # pip install python-louvain\n",
    "import google.auth\n",
    "import google.auth.transport.requests\n",
    "\n",
    "credentials = google.auth.default()[0]\n",
    "request = google.auth.transport.requests.Request()\n",
    "credentials.refresh(request)\n",
    "\n",
    "\n",
    "query_list = df[\"Question\"].tolist()\n",
    "query_docs = [Document(text=t) for t in query_list] # To make it LlamaIndex compatible\n",
    "embed_model = VertexTextEmbedding(credentials=credentials, model_name=\"text-embedding-005\")\n",
    "llm = Vertex(model=\"gemini-2.0-flash-001\",\n",
    "             temperature=0.2,\n",
    "             max_tokens=8192,\n",
    "             safety_settings={\n",
    "                    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,\n",
    "                    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,\n",
    "                    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,\n",
    "                    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,\n",
    "        }\n",
    ")\n",
    "Settings.llm = llm\n",
    "Settings.embed_model = embed_model\n",
    "\n",
    "\n",
    "# Form a local vector index with all our questions\n",
    "vector_index = VectorStoreIndex.from_documents(query_docs)\n",
    "vector_retriever = vector_index.as_retriever(similarity_top_k=CLUSTERING_NEIGHBORHOOD_SIZE)\n",
    "\n",
    "\n",
    "# Create a similarity graph\n",
    "G = nx.Graph()\n",
    "\n",
    "# Get a neighborhood of similar questions by querying the vector index\n",
    "similar_texts = await tqdm_asyncio.gather(*[vector_retriever.aretrieve(text) for i, text in enumerate(query_list)])\n",
    "\n",
    "for i, text in enumerate(query_list):\n",
    "  for s in similar_texts[i]:\n",
    "    G.add_edge(text, s.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "UKhkdk7UcV88",
   "metadata": {
    "collapsed": true,
    "id": "UKhkdk7UcV88",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Apply Louvain Community Detection\n",
    "partition = community_louvain.best_partition(G)\n",
    "df[\"cluster_idx\"] = df[\"Question\"].map(partition)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ryVyUMhCcb-o",
   "metadata": {
    "id": "ryVyUMhCcb-o"
   },
   "outputs": [],
   "source": [
    "grouped_df = pd.DataFrame(df.groupby(\"cluster_idx\")['Question'].apply(list)).reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "kyapkNuvKRZZ",
   "metadata": {
    "collapsed": true,
    "id": "kyapkNuvKRZZ",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cluster_idx</th>\n",
       "      <th>Question</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>[How can I create a virtual machine instance o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>[\"What is BigQuery, and how can I use it to an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>[\"What are the different tools available for d...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>[\"I need to increase the storage space on my C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>[\"My application is experiencing performance i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>[\"What are preemptible instances, and how can ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>[\"What is Cloud Load Balancing, and how does i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>[\"I need to transfer a large amount of data to...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>[\"I'm trying to train a machine learning model...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>[\"I'm concerned about the security of my sensi...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cluster_idx                                           Question\n",
       "0            0  [How can I create a virtual machine instance o...\n",
       "1            1  [\"What is BigQuery, and how can I use it to an...\n",
       "2            2  [\"What are the different tools available for d...\n",
       "3            3  [\"I need to increase the storage space on my C...\n",
       "4            4  [\"My application is experiencing performance i...\n",
       "5            5  [\"What are preemptible instances, and how can ...\n",
       "6            6  [\"What is Cloud Load Balancing, and how does i...\n",
       "7            7  [\"I need to transfer a large amount of data to...\n",
       "8            8  [\"I'm trying to train a machine learning model...\n",
       "9            9  [\"I'm concerned about the security of my sensi..."
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "K5UnIAP7KEX5",
   "metadata": {
    "id": "K5UnIAP7KEX5"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cluster_idx</th>\n",
       "      <th>Question</th>\n",
       "      <th>num_questions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>[How can I create a virtual machine instance o...</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>[\"What is BigQuery, and how can I use it to an...</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>[\"What are the different tools available for d...</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>[\"I need to increase the storage space on my C...</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>[\"My application is experiencing performance i...</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>[\"What are preemptible instances, and how can ...</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>[\"What is Cloud Load Balancing, and how does i...</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>[\"I need to transfer a large amount of data to...</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>[\"I'm trying to train a machine learning model...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>[\"I'm concerned about the security of my sensi...</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cluster_idx                                           Question  \\\n",
       "0            0  [How can I create a virtual machine instance o...   \n",
       "1            1  [\"What is BigQuery, and how can I use it to an...   \n",
       "2            2  [\"What are the different tools available for d...   \n",
       "3            3  [\"I need to increase the storage space on my C...   \n",
       "4            4  [\"My application is experiencing performance i...   \n",
       "5            5  [\"What are preemptible instances, and how can ...   \n",
       "6            6  [\"What is Cloud Load Balancing, and how does i...   \n",
       "7            7  [\"I need to transfer a large amount of data to...   \n",
       "8            8  [\"I'm trying to train a machine learning model...   \n",
       "9            9  [\"I'm concerned about the security of my sensi...   \n",
       "\n",
       "   num_questions  \n",
       "0              9  \n",
       "1              5  \n",
       "2             12  \n",
       "3             11  \n",
       "4              8  \n",
       "5             13  \n",
       "6             12  \n",
       "7              6  \n",
       "8              4  \n",
       "9             20  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_df[\"num_questions\"] = grouped_df[\"Question\"].apply(len)\n",
    "grouped_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "RzT0fZC-RsaB",
   "metadata": {
    "id": "RzT0fZC-RsaB"
   },
   "source": [
    "## Analyze Clusters Using Gemini\n",
    "\n",
    "We can use Gemini to extract summaries, topics, relevant questions, sentiment or any other required information from the cluster.\n",
    "This allows us to quickly identify higher level patterns about the various questions from users, understand different user problems and much more insightful information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "8Usbh9JmRvpB",
   "metadata": {
    "id": "8Usbh9JmRvpB"
   },
   "outputs": [],
   "source": [
    "from vertexai.generative_models import GenerativeModel, GenerationConfig\n",
    "from vertexai.generative_models import HarmCategory, HarmBlockThreshold\n",
    "from llama_index.core.program import LLMTextCompletionProgram\n",
    "from llama_index.core.output_parsers import PydanticOutputParser\n",
    "from pydantic import BaseModel, Field\n",
    "from typing import Annotated\n",
    "from enum import Enum\n",
    "from annotated_types import Len\n",
    "\n",
    "num_clusters = grouped_df.shape[0]\n",
    "\n",
    "class Sentiment(Enum):\n",
    "  POSITIVE = \"positive\"\n",
    "  NEGATIVE = \"negative\"\n",
    "  NEUTRAL = \"neutral\"\n",
    "\n",
    "class ClusterSummary(BaseModel):\n",
    "  '''A cluster summary, list of topics, most representative questions, and sentiment associated with a cluster of questions from chat sessions.'''\n",
    "  summary_desc: str\n",
    "  topics: List[str]\n",
    "  most_representative_qs: Annotated[List[str], Len(3, 8)]\n",
    "  sentiment: Sentiment\n",
    "\n",
    "\n",
    "boring_prompt = \"\"\"Please provide a brief summary which captures the nature of the given cluster of questions below in the form of \"Questions concerning ____\".\n",
    "                  \\n Cluster questions:\n",
    "                  \\n {questions_list}\n",
    "                  \\n The clusters titles should not be generic such as \"Google Cloud AI\" or \"Gemini\".\n",
    "                  \\n They need to be specific in order to distinguish the clusters from others which may be similar.\n",
    "                  \\n Also include a list of topic phrases which the questions address, the most representative questions of the cluster, and an overall sentiment. Be sure to follow a consistent format.\"\"\"\n",
    "\n",
    "movie_prompt = \"\"\"You are an expert movie producer for famous movies.\n",
    "                  \\n Please provide a quipy, movie title which captures the essence of the given cluster of questions below.\n",
    "                  \\n Example:\n",
    "                  \\n How does RAG work on Vertex?\n",
    "                  \\n Where can I find documentation on Vertex AI Generative model API?\n",
    "                  \\n What are the pitfals of Gemini vs. Gemma?\n",
    "                  \\n Answer:\n",
    "                  \\n movie title: \"Into the Vertex\"\n",
    "                  \\n representative qs: How does RAG work on Vertex?\n",
    "                  \\n topics: Vertex AI, Vertex AI Generative Model\n",
    "                  \\n sentiment: neutral\n",
    "                  \\n Cluster questions:\n",
    "                  \\n {questions_list}\n",
    "                  \\n Also include a list of topic phrases which the questions address, the most representative questions of the cluster, and an overall sentiment. Be sure to follow a consistent format. \"\"\"\n",
    "\n",
    "async def summarize_cluster(questions: List[str]):\n",
    "  questions_list = \"\\n\".join(questions)\n",
    "  llm_program = LLMTextCompletionProgram.from_defaults(\n",
    "        output_parser=PydanticOutputParser(ClusterSummary),\n",
    "        prompt_template_str=boring_prompt,\n",
    "        verbose=True,\n",
    "    )\n",
    "  try:\n",
    "    cluster_summary = await llm_program.acall(questions_list=questions_list)\n",
    "  except Exception as e:\n",
    "    print(e)\n",
    "    return None\n",
    "  return cluster_summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "9EUj5xwpgMzU",
   "metadata": {
    "id": "9EUj5xwpgMzU"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10/10 [00:05<00:00,  1.70it/s]\n"
     ]
    }
   ],
   "source": [
    "# Summarize each cluster individually\n",
    "cluster_summaries = await tqdm_asyncio.gather(*[summarize_cluster(q[\"Question\"]) for idx, q in grouped_df.iterrows()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3l0OnOUO9pXh",
   "metadata": {
    "collapsed": true,
    "id": "3l0OnOUO9pXh",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[ClusterSummary(summary_desc='Questions concerning the practical usage and troubleshooting of Google Compute Engine virtual machine instances, including instance creation, selection, connection, deletion recovery, clustering for high availability, firewall issues, and pricing.', topics=['Google Compute Engine', 'Virtual Machine Instances', 'Instance Creation', 'Machine Types', 'Pricing', 'SSH Connection', 'Instance Deletion Recovery', 'High Availability Clustering', 'Firewall Troubleshooting', 'Discounts'], most_representative_qs=['How can I create a virtual machine instance on Compute Engine?', 'What are the different machine types available on Compute Engine, and how do I choose the right one for my needs?', 'Can you explain the different pricing options for Compute Engine instances?', 'How do I connect to my Compute Engine instance using SSH?', 'I accidentally deleted my Compute Engine instance. How can I recover it?', 'I want to set up a cluster of Compute Engine instances for high availability. Can you guide me through the process?', \"I'm having trouble connecting to my Virtual Machine instance. I think there's a firewall issue. How can I troubleshoot this?\"], sentiment=<Sentiment.NEUTRAL: 'neutral'>),\n",
       " ClusterSummary(summary_desc='Questions concerning the practical application and optimization of BigQuery for large dataset analysis, including data loading, query performance, visualization, and machine learning integration.', topics=['BigQuery', 'large datasets', 'data analysis', 'data loading', 'query optimization', 'performance', 'visualization', 'Data Studio', 'machine learning'], most_representative_qs=['What is BigQuery, and how can I use it to analyze large datasets?', 'I have a large dataset that I want to analyze using BigQuery. How can I load my data into BigQuery?', 'My BigQuery queries are taking a long time to run. How can I optimize my queries for better performance?', 'I want to visualize my data in BigQuery using Data Studio. How can I connect Data Studio to my BigQuery dataset?', 'How can I use machine learning with BigQuery to gain insights from my data?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),\n",
       " ClusterSummary(summary_desc=\"Questions concerning the practical application of Google Cloud's Machine Learning and Data Processing tools for building, deploying, and monitoring models.\", topics=['Data Processing on Google Cloud', 'Data Pipelines', 'Data Visualization', 'Real-time Data Processing', 'Pre-trained Models for Image Recognition', 'AutoML', 'Machine Learning Model Deployment', 'Machine Learning Model Monitoring', 'AI and ML Services on Google Cloud'], most_representative_qs=['How can I use Google Cloud to build a data pipeline?', 'How can I use Google Cloud to visualize my data?', 'How can I use AutoML to build a machine learning model without writing any code?', 'I need to monitor the performance of my deployed machine learning model. What tools are available on Google Cloud?', 'How can I use Google Cloud to build a machine learning model?', 'How can I use Google Cloud to deploy my machine learning model?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),\n",
       " ClusterSummary(summary_desc='Questions concerning Google Cloud database services, particularly storage, migration, scaling, and performance optimization for Cloud SQL and Cloud Spanner.', topics=['Google Cloud Storage', 'Database Services', 'Cloud SQL', 'Cloud Spanner', 'Database Migration', 'Database Scaling', 'Storage Capacity', 'Database Replication', 'Query Performance', 'Database Security'], most_representative_qs=['What database services are available on Google Cloud?', 'What is the difference between Cloud SQL and Cloud Spanner?', 'How do I migrate my existing database to Google Cloud?', 'How can I scale my database on Google Cloud?', 'My Cloud SQL database is running out of storage space. How can I increase the storage capacity?', 'I need to replicate my Cloud SQL database to another region for disaster recovery. How can I set up database replication?', \"I'm experiencing slow query performance on my Cloud Spanner database. How can I optimize my database and queries?\"], sentiment=<Sentiment.NEUTRAL: 'neutral'>),\n",
       " ClusterSummary(summary_desc='Questions concerning the monitoring, troubleshooting, and optimization of application performance on Google Cloud Platform, particularly focusing on diagnosing latency, utilizing logging and monitoring tools, and understanding specific services like Compute Engine and Cloud Run.', topics=['application performance', 'troubleshooting', 'optimization', 'Compute Engine', 'network latency', 'Google Kubernetes Engine', 'Cloud Logging', 'Cloud Monitoring', 'Cloud Run'], most_representative_qs=['My application is experiencing performance issues. How can I troubleshoot and optimize my Compute Engine instance?', 'My application is experiencing high latency. Could it be a networking issue? How can I diagnose and resolve network latency problems?', 'I want to monitor the performance of my applications running on Google Kubernetes Engine. What tools can I use?', 'How can I monitor the performance and logs of my Cloud Run services?', 'What is Cloud Monitoring, and how does it work?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),\n",
       " ClusterSummary(summary_desc='Questions concerning cost optimization strategies and troubleshooting unexpected expenses within Google Cloud Platform.', topics=['Preemptible Instances', 'Cost Optimization', 'Unused Resources', 'Cloud Storage Costs', 'High CPU Usage', 'Cost Allocation', 'Budgeting', 'Cost Analysis', 'Resource Management'], most_representative_qs=['What are preemptible instances, and how can they save me money?', \"I'm getting billed for a Compute Engine instance that I'm not using. How can I identify and shut down unused instances?\", 'My Cloud Storage costs are higher than expected. How can I analyze my usage and optimize my storage costs?', 'My Google Cloud bill is higher than expected this month. How can I identify the source of the increased cost?', 'I want to track the cost of my Google Cloud resources by department. How can I set up cost allocation?', 'What are some best practices for optimizing my Google Cloud costs?'], sentiment=<Sentiment.NEGATIVE: 'negative'>),\n",
       " ClusterSummary(summary_desc='Questions concerning the automation of application deployment, management, and scaling on Google Cloud, particularly focusing on serverless technologies like Cloud Functions and Cloud Run.', topics=['Cloud Load Balancing', 'application deployment automation', 'Google Cloud resource management', 'task automation on Google Cloud', 'serverless computing', 'serverless platforms on Google Cloud', 'serverless application development and deployment', 'Cloud Functions', 'Google Cloud Run', 'containerized applications', 'API development with Cloud Functions', 'Cloud Function timeout limits', 'Cloud Run deployment configuration'], most_representative_qs=['I need to automate the deployment of my applications on Google Cloud. What tools and services can I use?', 'What tools are available for managing my Google Cloud resources?', 'How can I automate tasks on Google Cloud?', 'What is serverless computing, and what are its benefits?', 'What serverless platforms are available on Google Cloud?', 'How can I build and deploy a serverless application on Google Cloud?', 'What is Cloud Functions, and how does it work?', 'How can I use Google Cloud Run to deploy containerized applications?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),\n",
       " ClusterSummary(summary_desc='Questions concerning the practical aspects of using Google Cloud Storage, such as data transfer methods, file recovery, access management, and pricing.', topics=['data transfer', 'file recovery', 'public access', 'data upload', 'data access', 'pricing'], most_representative_qs=[\"I need to transfer a large amount of data to Google Cloud Storage. What's the most efficient way to do this?\", 'I accidentally deleted some files from my Cloud Storage bucket. How can I recover them?', 'I want to make my data in Cloud Storage available to the public. How can I configure public access?', 'How much does it cost to store data in Google Cloud Storage?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),\n",
       " ClusterSummary(summary_desc='Questions concerning practical challenges and usage of Vertex AI for machine learning tasks.', topics=['Vertex AI', 'Machine Learning', 'Model Training', 'Error Troubleshooting', 'Model Deployment', 'API', 'IAM Configuration'], most_representative_qs=[\"I'm trying to train a machine learning model on Vertex AI, but I'm getting errors. How can I troubleshoot these errors?\", 'I want to deploy my trained machine learning model as an API. How can I do this using Vertex AI?', 'What is Vertex AI, and how can I use it?', \"I'm having trouble configuring IAM to use Vertex AI. What do I do?\"], sentiment=<Sentiment.NEUTRAL: 'neutral'>),\n",
       " ClusterSummary(summary_desc='Questions concerning securing Google Cloud resources and infrastructure, particularly focusing on networking, access control, and data protection.', topics=['Cloud Storage Security', 'Virtual Private Cloud (VPC)', 'Firewall Rules', 'Network Connectivity', 'Network Security', 'Data Security', 'Identity and Access Management (IAM)', 'Multi-Factor Authentication', 'Security Best Practices', 'Security Monitoring', 'Vulnerability Remediation', 'Secure Application Development'], most_representative_qs=['What are the security features of Google Cloud Storage?', 'How do I create a Virtual Private Cloud (VPC) on Google Cloud?', 'What are firewalls, and how do I configure them in Google Cloud?', 'How can I connect my on-premises network to Google Cloud?', 'How can I secure my applications and data on Google Cloud?', 'What is Identity and Access Management (IAM), and how does it work?', 'How can I implement multi-factor authentication on Google Cloud?', 'What are security best practices for Google Cloud?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cluster_summaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "k0U7a6nqZeMP",
   "metadata": {
    "id": "k0U7a6nqZeMP"
   },
   "outputs": [],
   "source": [
    "just_summaries = [c.summary_desc if c else None for c in cluster_summaries]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "oLp3ceBVSEo2",
   "metadata": {
    "id": "oLp3ceBVSEo2"
   },
   "outputs": [],
   "source": [
    "df_grouped_by_cluster = df.groupby(\"cluster_idx\").agg(\"count\")\n",
    "df_grouped_by_cluster[\"cluster_summary\"] = cluster_summaries\n",
    "df_grouped_by_cluster[\"just_summary\"] = just_summaries\n",
    "df_grouped_by_cluster[\"questions_list\"] = grouped_df[\"Question\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "tt3kS_tPA6XL",
   "metadata": {
    "id": "tt3kS_tPA6XL"
   },
   "outputs": [],
   "source": [
    "from fasthtml.common import *\n",
    "from fasthtml.fastapp import *\n",
    "from random import sample\n",
    "from fasthtml.components import Zero_md\n",
    "\n",
    "tlink = Script(src=\"https://cdn.tailwindcss.com\")\n",
    "dlink = Link(rel=\"stylesheet\", href=\"https://cdn.jsdelivr.net/npm/daisyui@4.11.1/dist/full.min.css\")\n",
    "app = FastHTML(hdrs=(dlink, tlink))\n",
    "\n",
    "def Markdown(md, css = ''):\n",
    "    css_template = Template(Style(css), data_append=True)\n",
    "    return Zero_md(css_template, Script(md, type=\"text/markdown\"))\n",
    "\n",
    "def MarkdownWOutBackground(md: str):\n",
    "    css = '.markdown-body {background-color: unset !important; color: unset !important;} .markdown-body table {color: black !important;}'\n",
    "    markdown_wout_background = partial(Markdown, css=css)\n",
    "    return markdown_wout_background(md)\n",
    "\n",
    "def stat_card(num_questions: int):\n",
    "  return Div(\n",
    "    Div('Total Questions', cls='stat-title'),\n",
    "    Div(f'{num_questions}', cls='stat-value'),\n",
    "    cls='stat'\n",
    "  )\n",
    "\n",
    "def cluster_card(cluster_summary: ClusterSummary, questions_list: List[str]):\n",
    "  if cluster_summary.sentiment == Sentiment.NEGATIVE:\n",
    "    badge_color = \"error\"\n",
    "  elif cluster_summary.sentiment == Sentiment.NEUTRAL:\n",
    "    badge_color = \"neutral\"\n",
    "  else:\n",
    "    badge_color = \"success\"\n",
    "  return Div(\n",
    "              Div(\n",
    "                  H2(cluster_summary.summary_desc, cls='card-title'),\n",
    "                  Div(\n",
    "                      stat_card(len(questions_list)),\n",
    "                      Div(cluster_summary.sentiment, cls=f'badge badge-{badge_color}'),\n",
    "                      cls=\"flex flex-row items-center\"\n",
    "                  ),\n",
    "                  H4(\"Representative Questions:\", cls=\"font-bold\"),\n",
    "                  Ul(\n",
    "                      *[Li(q) for q in cluster_summary.most_representative_qs],\n",
    "                      cls='list-disc list-inside mt-2'\n",
    "                  ),\n",
    "                  H4(\"Topics Discussed:\", cls=\"font-bold\"),\n",
    "                  Ul(\n",
    "                      *[Li(t) for t in cluster_summary.topics],\n",
    "                      cls='list-disc list-inside mt-2'\n",
    "                  ),\n",
    "                  cls='card-body'\n",
    "              ),\n",
    "              cls='card bg-base-100 shadow-xl'\n",
    "          )\n",
    "\n",
    "@app.get(\"/\")\n",
    "def cluster_analysis():\n",
    "    return Div(\n",
    "              *[cluster_card(c, q) for c, q in zip(cluster_summaries, df_grouped_by_cluster[\"questions_list\"])],\n",
    "              cls=\"grid grid-cols-2 gap-2\"\n",
    "            )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "DjLQVD7MRVn1",
   "metadata": {
    "id": "DjLQVD7MRVn1"
   },
   "source": [
    "## Gemini-generated Cluster Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "1a4ZayTxHe8V",
   "metadata": {
    "id": "1a4ZayTxHe8V"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       " <!doctype html>\n",
       " <html>\n",
       "   <head>\n",
       "<title>FastHTML page</title>     <meta charset=\"utf-8\">\n",
       "     <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, viewport-fit=cover\">\n",
       "<script src=\"https://unpkg.com/htmx.org@next/dist/htmx.min.js\"></script><script src=\"https://cdn.jsdelivr.net/gh/answerdotai/fasthtml-js@1.0.4/fasthtml.js\"></script><script src=\"https://cdn.jsdelivr.net/gh/answerdotai/surreal@main/surreal.js\"></script><script src=\"https://cdn.jsdelivr.net/gh/gnat/css-scope-inline@main/script.js\"></script>     <link rel=\"stylesheet\" href=\"https://cdn.jsdelivr.net/npm/daisyui@4.11.1/dist/full.min.css\">\n",
       "<script src=\"https://cdn.tailwindcss.com\"></script>   </head>\n",
       "   <body>\n",
       "     <div class=\"grid grid-cols-2 gap-2\">\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning the practical usage and troubleshooting of Google Compute Engine virtual machine instances, including instance creation, selection, connection, deletion recovery, clustering for high availability, firewall issues, and pricing.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">9</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>How can I create a virtual machine instance on Compute Engine?</li>\n",
       "             <li>What are the different machine types available on Compute Engine, and how do I choose the right one for my needs?</li>\n",
       "             <li>Can you explain the different pricing options for Compute Engine instances?</li>\n",
       "             <li>How do I connect to my Compute Engine instance using SSH?</li>\n",
       "             <li>I accidentally deleted my Compute Engine instance. How can I recover it?</li>\n",
       "             <li>I want to set up a cluster of Compute Engine instances for high availability. Can you guide me through the process?</li>\n",
       "             <li>I&#x27;m having trouble connecting to my Virtual Machine instance. I think there&#x27;s a firewall issue. How can I troubleshoot this?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>Google Compute Engine</li>\n",
       "             <li>Virtual Machine Instances</li>\n",
       "             <li>Instance Creation</li>\n",
       "             <li>Machine Types</li>\n",
       "             <li>Pricing</li>\n",
       "             <li>SSH Connection</li>\n",
       "             <li>Instance Deletion Recovery</li>\n",
       "             <li>High Availability Clustering</li>\n",
       "             <li>Firewall Troubleshooting</li>\n",
       "             <li>Discounts</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning the practical application and optimization of BigQuery for large dataset analysis, including data loading, query performance, visualization, and machine learning integration.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">5</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>What is BigQuery, and how can I use it to analyze large datasets?</li>\n",
       "             <li>I have a large dataset that I want to analyze using BigQuery. How can I load my data into BigQuery?</li>\n",
       "             <li>My BigQuery queries are taking a long time to run. How can I optimize my queries for better performance?</li>\n",
       "             <li>I want to visualize my data in BigQuery using Data Studio. How can I connect Data Studio to my BigQuery dataset?</li>\n",
       "             <li>How can I use machine learning with BigQuery to gain insights from my data?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>BigQuery</li>\n",
       "             <li>large datasets</li>\n",
       "             <li>data analysis</li>\n",
       "             <li>data loading</li>\n",
       "             <li>query optimization</li>\n",
       "             <li>performance</li>\n",
       "             <li>visualization</li>\n",
       "             <li>Data Studio</li>\n",
       "             <li>machine learning</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning the practical application of Google Cloud&#x27;s Machine Learning and Data Processing tools for building, deploying, and monitoring models.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">12</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>How can I use Google Cloud to build a data pipeline?</li>\n",
       "             <li>How can I use Google Cloud to visualize my data?</li>\n",
       "             <li>How can I use AutoML to build a machine learning model without writing any code?</li>\n",
       "             <li>I need to monitor the performance of my deployed machine learning model. What tools are available on Google Cloud?</li>\n",
       "             <li>How can I use Google Cloud to build a machine learning model?</li>\n",
       "             <li>How can I use Google Cloud to deploy my machine learning model?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>Data Processing on Google Cloud</li>\n",
       "             <li>Data Pipelines</li>\n",
       "             <li>Data Visualization</li>\n",
       "             <li>Real-time Data Processing</li>\n",
       "             <li>Pre-trained Models for Image Recognition</li>\n",
       "             <li>AutoML</li>\n",
       "             <li>Machine Learning Model Deployment</li>\n",
       "             <li>Machine Learning Model Monitoring</li>\n",
       "             <li>AI and ML Services on Google Cloud</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning Google Cloud database services, particularly storage, migration, scaling, and performance optimization for Cloud SQL and Cloud Spanner.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">11</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>What database services are available on Google Cloud?</li>\n",
       "             <li>What is the difference between Cloud SQL and Cloud Spanner?</li>\n",
       "             <li>How do I migrate my existing database to Google Cloud?</li>\n",
       "             <li>How can I scale my database on Google Cloud?</li>\n",
       "             <li>My Cloud SQL database is running out of storage space. How can I increase the storage capacity?</li>\n",
       "             <li>I need to replicate my Cloud SQL database to another region for disaster recovery. How can I set up database replication?</li>\n",
       "             <li>I&#x27;m experiencing slow query performance on my Cloud Spanner database. How can I optimize my database and queries?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>Google Cloud Storage</li>\n",
       "             <li>Database Services</li>\n",
       "             <li>Cloud SQL</li>\n",
       "             <li>Cloud Spanner</li>\n",
       "             <li>Database Migration</li>\n",
       "             <li>Database Scaling</li>\n",
       "             <li>Storage Capacity</li>\n",
       "             <li>Database Replication</li>\n",
       "             <li>Query Performance</li>\n",
       "             <li>Database Security</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning the monitoring, troubleshooting, and optimization of application performance on Google Cloud Platform, particularly focusing on diagnosing latency, utilizing logging and monitoring tools, and understanding specific services like Compute Engine and Cloud Run.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">8</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>My application is experiencing performance issues. How can I troubleshoot and optimize my Compute Engine instance?</li>\n",
       "             <li>My application is experiencing high latency. Could it be a networking issue? How can I diagnose and resolve network latency problems?</li>\n",
       "             <li>I want to monitor the performance of my applications running on Google Kubernetes Engine. What tools can I use?</li>\n",
       "             <li>How can I monitor the performance and logs of my Cloud Run services?</li>\n",
       "             <li>What is Cloud Monitoring, and how does it work?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>application performance</li>\n",
       "             <li>troubleshooting</li>\n",
       "             <li>optimization</li>\n",
       "             <li>Compute Engine</li>\n",
       "             <li>network latency</li>\n",
       "             <li>Google Kubernetes Engine</li>\n",
       "             <li>Cloud Logging</li>\n",
       "             <li>Cloud Monitoring</li>\n",
       "             <li>Cloud Run</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning cost optimization strategies and troubleshooting unexpected expenses within Google Cloud Platform.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">13</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-error\">Sentiment.NEGATIVE</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>What are preemptible instances, and how can they save me money?</li>\n",
       "             <li>I&#x27;m getting billed for a Compute Engine instance that I&#x27;m not using. How can I identify and shut down unused instances?</li>\n",
       "             <li>My Cloud Storage costs are higher than expected. How can I analyze my usage and optimize my storage costs?</li>\n",
       "             <li>My Google Cloud bill is higher than expected this month. How can I identify the source of the increased cost?</li>\n",
       "             <li>I want to track the cost of my Google Cloud resources by department. How can I set up cost allocation?</li>\n",
       "             <li>What are some best practices for optimizing my Google Cloud costs?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>Preemptible Instances</li>\n",
       "             <li>Cost Optimization</li>\n",
       "             <li>Unused Resources</li>\n",
       "             <li>Cloud Storage Costs</li>\n",
       "             <li>High CPU Usage</li>\n",
       "             <li>Cost Allocation</li>\n",
       "             <li>Budgeting</li>\n",
       "             <li>Cost Analysis</li>\n",
       "             <li>Resource Management</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning the automation of application deployment, management, and scaling on Google Cloud, particularly focusing on serverless technologies like Cloud Functions and Cloud Run.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">12</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>I need to automate the deployment of my applications on Google Cloud. What tools and services can I use?</li>\n",
       "             <li>What tools are available for managing my Google Cloud resources?</li>\n",
       "             <li>How can I automate tasks on Google Cloud?</li>\n",
       "             <li>What is serverless computing, and what are its benefits?</li>\n",
       "             <li>What serverless platforms are available on Google Cloud?</li>\n",
       "             <li>How can I build and deploy a serverless application on Google Cloud?</li>\n",
       "             <li>What is Cloud Functions, and how does it work?</li>\n",
       "             <li>How can I use Google Cloud Run to deploy containerized applications?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>Cloud Load Balancing</li>\n",
       "             <li>application deployment automation</li>\n",
       "             <li>Google Cloud resource management</li>\n",
       "             <li>task automation on Google Cloud</li>\n",
       "             <li>serverless computing</li>\n",
       "             <li>serverless platforms on Google Cloud</li>\n",
       "             <li>serverless application development and deployment</li>\n",
       "             <li>Cloud Functions</li>\n",
       "             <li>Google Cloud Run</li>\n",
       "             <li>containerized applications</li>\n",
       "             <li>API development with Cloud Functions</li>\n",
       "             <li>Cloud Function timeout limits</li>\n",
       "             <li>Cloud Run deployment configuration</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning the practical aspects of using Google Cloud Storage, such as data transfer methods, file recovery, access management, and pricing.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">6</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>I need to transfer a large amount of data to Google Cloud Storage. What&#x27;s the most efficient way to do this?</li>\n",
       "             <li>I accidentally deleted some files from my Cloud Storage bucket. How can I recover them?</li>\n",
       "             <li>I want to make my data in Cloud Storage available to the public. How can I configure public access?</li>\n",
       "             <li>How much does it cost to store data in Google Cloud Storage?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>data transfer</li>\n",
       "             <li>file recovery</li>\n",
       "             <li>public access</li>\n",
       "             <li>data upload</li>\n",
       "             <li>data access</li>\n",
       "             <li>pricing</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning practical challenges and usage of Vertex AI for machine learning tasks.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">4</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>I&#x27;m trying to train a machine learning model on Vertex AI, but I&#x27;m getting errors. How can I troubleshoot these errors?</li>\n",
       "             <li>I want to deploy my trained machine learning model as an API. How can I do this using Vertex AI?</li>\n",
       "             <li>What is Vertex AI, and how can I use it?</li>\n",
       "             <li>I&#x27;m having trouble configuring IAM to use Vertex AI. What do I do?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>Vertex AI</li>\n",
       "             <li>Machine Learning</li>\n",
       "             <li>Model Training</li>\n",
       "             <li>Error Troubleshooting</li>\n",
       "             <li>Model Deployment</li>\n",
       "             <li>API</li>\n",
       "             <li>IAM Configuration</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "       <div class=\"card bg-base-100 shadow-xl\">\n",
       "         <div class=\"card-body\">\n",
       "           <h2 class=\"card-title\">Questions concerning securing Google Cloud resources and infrastructure, particularly focusing on networking, access control, and data protection.</h2>\n",
       "           <div class=\"flex flex-row items-center\">\n",
       "             <div class=\"stat\">\n",
       "               <div class=\"stat-title\">Total Questions</div>\n",
       "               <div class=\"stat-value\">20</div>\n",
       "             </div>\n",
       "             <div class=\"badge badge-neutral\">Sentiment.NEUTRAL</div>\n",
       "           </div>\n",
       "           <h4 class=\"font-bold\">Representative Questions:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>What are the security features of Google Cloud Storage?</li>\n",
       "             <li>How do I create a Virtual Private Cloud (VPC) on Google Cloud?</li>\n",
       "             <li>What are firewalls, and how do I configure them in Google Cloud?</li>\n",
       "             <li>How can I connect my on-premises network to Google Cloud?</li>\n",
       "             <li>How can I secure my applications and data on Google Cloud?</li>\n",
       "             <li>What is Identity and Access Management (IAM), and how does it work?</li>\n",
       "             <li>How can I implement multi-factor authentication on Google Cloud?</li>\n",
       "             <li>What are security best practices for Google Cloud?</li>\n",
       "           </ul>\n",
       "           <h4 class=\"font-bold\">Topics Discussed:</h4>\n",
       "           <ul class=\"list-disc list-inside mt-2\">\n",
       "             <li>Cloud Storage Security</li>\n",
       "             <li>Virtual Private Cloud (VPC)</li>\n",
       "             <li>Firewall Rules</li>\n",
       "             <li>Network Connectivity</li>\n",
       "             <li>Network Security</li>\n",
       "             <li>Data Security</li>\n",
       "             <li>Identity and Access Management (IAM)</li>\n",
       "             <li>Multi-Factor Authentication</li>\n",
       "             <li>Security Best Practices</li>\n",
       "             <li>Security Monitoring</li>\n",
       "             <li>Vulnerability Remediation</li>\n",
       "             <li>Secure Application Development</li>\n",
       "           </ul>\n",
       "         </div>\n",
       "       </div>\n",
       "     </div>\n",
       "   </body>\n",
       " </html>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from starlette.testclient import TestClient\n",
    "client = TestClient(app)\n",
    "r = client.get(\"/\")\n",
    "show(r.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "_azGz91Do81r",
   "metadata": {
    "id": "_azGz91Do81r"
   },
   "source": [
    "## Sample Questions from Each Cluster to create the Eval Dataset\n",
    "- We can sample randomly proportional to each cluster's size\n",
    "- Or we can take samples from the most representative questions Gemini identified\n",
    "\n",
    "Probably need to sit down with an SME and compare both:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "vPMNsdMuBWq9",
   "metadata": {
    "collapsed": true,
    "id": "vPMNsdMuBWq9",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Calculate the total number of questions\n",
    "total_questions = df_grouped_by_cluster['question_len'].sum()\n",
    "\n",
    "# Calculate the fraction of questions for each row\n",
    "df_grouped_by_cluster['cluster_fraction'] = df_grouped_by_cluster['question_len'] / total_questions\n",
    "\n",
    "# Function to sample from a list based on the fraction\n",
    "def sample_questions(row, num_samples):\n",
    "    return np.random.choice(row['questions_list'],\n",
    "                            size=int(num_samples * row['cluster_fraction']),\n",
    "                            replace=False).tolist()\n",
    "\n",
    "# Specify the total number of samples you want\n",
    "total_samples = 50\n",
    "\n",
    "# Apply the sampling function to each row\n",
    "df_grouped_by_cluster['proportional_sampled_questions'] = df_grouped_by_cluster.apply(lambda row: sample_questions(row, total_samples), axis=1)\n",
    "\n",
    "# Unroll the DataFrame\n",
    "df_grouped_by_cluster = df_grouped_by_cluster.reset_index()\n",
    "\n",
    "# Print the resulting DataFrame\n",
    "unrolled_proportional_df = df_grouped_by_cluster.apply(lambda x: pd.Series({\n",
    "    'cluster_title': [x[\"just_summary\"]] * len(x['proportional_sampled_questions']),\n",
    "    'sampled_question': x['proportional_sampled_questions']\n",
    "}), axis=1)\n",
    "\n",
    "# Concatenate the series and reset the index\n",
    "unrolled_proportional_df = pd.concat([unrolled_proportional_df['cluster_title'].explode(),\n",
    "                         unrolled_proportional_df['sampled_question'].explode()],\n",
    "                        axis=1).reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "nI5MjuDiHZLa",
   "metadata": {
    "collapsed": true,
    "id": "nI5MjuDiHZLa",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cluster_title</th>\n",
       "      <th>sampled_question</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>\"What are the different machine types availabl...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>\"I accidentally deleted my Compute Engine inst...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>\"Are there any discounts or sustained use disc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>\"I'm having trouble connecting to my Virtual M...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>\"How can I use machine learning with BigQuery ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>\"What is BigQuery, and how can I use it to ana...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>\"What is Dataflow, and how does it work?\"</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>\"What are the different tools available for da...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>\"How can I use Google Cloud to build a machine...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>\"I need to monitor the performance of my deplo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>\"How can I use AutoML to build a machine learn...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>\"How can I use Google Cloud to visualize my da...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>\"What are the different storage options availa...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>\"What database services are available on Googl...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>\"I need to increase the storage space on my Co...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>\"How can I scale my database on Google Cloud?\"</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>\"How do I migrate my existing database to Goog...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>\"How can I monitor the performance and logs of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>\"My application is experiencing performance is...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>\"What is Cloud Logging, and how can I use it t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>\"I'm trying to troubleshoot an issue with my a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>\"What are some best practices for optimizing m...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>\"My Cloud Storage costs are higher than expect...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>\"I'm not using some of my Google Cloud resourc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>\"How can I track and manage my Google Cloud co...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>\"What tools are available for cost management ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>\"My Google Cloud bill is higher than expected ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>\"My Cloud Function is timing out. How can I in...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>\"What is serverless computing, and what are it...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>\"How can I use Google Cloud Run to deploy cont...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>\"I need to deploy a containerized application ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>\"What is Cloud Functions, and how does it work?\"</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>\"I want to build a simple API using Cloud Func...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Questions concerning the practical aspects of ...</td>\n",
       "      <td>\"How can I upload data to Google Cloud Storage?\"</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Questions concerning the practical aspects of ...</td>\n",
       "      <td>\"I want to make my data in Cloud Storage avail...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>Questions concerning the practical aspects of ...</td>\n",
       "      <td>\"I need to transfer a large amount of data to ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>Questions concerning practical challenges and ...</td>\n",
       "      <td>\"I'm trying to train a machine learning model ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>Questions concerning practical challenges and ...</td>\n",
       "      <td>\"I'm having trouble configuring IAM to use Ver...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"How can I improve the security of my network ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"What is Identity and Access Management (IAM),...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"I want to ensure that only authorized users c...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"I'm concerned about the security of my sensit...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"What are firewalls, and how do I configure th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"How can I connect my on-premises network to G...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"What are the security features of Google Clou...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"I want to connect my Cloud Function to a Clou...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"I want to ensure that my network traffic is s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>\"What are security best practices for Google C...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        cluster_title  \\\n",
       "0   Questions concerning the practical usage and t...   \n",
       "1   Questions concerning the practical usage and t...   \n",
       "2   Questions concerning the practical usage and t...   \n",
       "3   Questions concerning the practical usage and t...   \n",
       "4   Questions concerning the practical application...   \n",
       "5   Questions concerning the practical application...   \n",
       "6   Questions concerning the practical application...   \n",
       "7   Questions concerning the practical application...   \n",
       "8   Questions concerning the practical application...   \n",
       "9   Questions concerning the practical application...   \n",
       "10  Questions concerning the practical application...   \n",
       "11  Questions concerning the practical application...   \n",
       "12  Questions concerning Google Cloud database ser...   \n",
       "13  Questions concerning Google Cloud database ser...   \n",
       "14  Questions concerning Google Cloud database ser...   \n",
       "15  Questions concerning Google Cloud database ser...   \n",
       "16  Questions concerning Google Cloud database ser...   \n",
       "17  Questions concerning the monitoring, troublesh...   \n",
       "18  Questions concerning the monitoring, troublesh...   \n",
       "19  Questions concerning the monitoring, troublesh...   \n",
       "20  Questions concerning the monitoring, troublesh...   \n",
       "21  Questions concerning cost optimization strateg...   \n",
       "22  Questions concerning cost optimization strateg...   \n",
       "23  Questions concerning cost optimization strateg...   \n",
       "24  Questions concerning cost optimization strateg...   \n",
       "25  Questions concerning cost optimization strateg...   \n",
       "26  Questions concerning cost optimization strateg...   \n",
       "27  Questions concerning the automation of applica...   \n",
       "28  Questions concerning the automation of applica...   \n",
       "29  Questions concerning the automation of applica...   \n",
       "30  Questions concerning the automation of applica...   \n",
       "31  Questions concerning the automation of applica...   \n",
       "32  Questions concerning the automation of applica...   \n",
       "33  Questions concerning the practical aspects of ...   \n",
       "34  Questions concerning the practical aspects of ...   \n",
       "35  Questions concerning the practical aspects of ...   \n",
       "36  Questions concerning practical challenges and ...   \n",
       "37  Questions concerning practical challenges and ...   \n",
       "38  Questions concerning securing Google Cloud res...   \n",
       "39  Questions concerning securing Google Cloud res...   \n",
       "40  Questions concerning securing Google Cloud res...   \n",
       "41  Questions concerning securing Google Cloud res...   \n",
       "42  Questions concerning securing Google Cloud res...   \n",
       "43  Questions concerning securing Google Cloud res...   \n",
       "44  Questions concerning securing Google Cloud res...   \n",
       "45  Questions concerning securing Google Cloud res...   \n",
       "46  Questions concerning securing Google Cloud res...   \n",
       "47  Questions concerning securing Google Cloud res...   \n",
       "\n",
       "                                     sampled_question  \n",
       "0   \"What are the different machine types availabl...  \n",
       "1   \"I accidentally deleted my Compute Engine inst...  \n",
       "2   \"Are there any discounts or sustained use disc...  \n",
       "3   \"I'm having trouble connecting to my Virtual M...  \n",
       "4   \"How can I use machine learning with BigQuery ...  \n",
       "5   \"What is BigQuery, and how can I use it to ana...  \n",
       "6           \"What is Dataflow, and how does it work?\"  \n",
       "7   \"What are the different tools available for da...  \n",
       "8   \"How can I use Google Cloud to build a machine...  \n",
       "9   \"I need to monitor the performance of my deplo...  \n",
       "10  \"How can I use AutoML to build a machine learn...  \n",
       "11  \"How can I use Google Cloud to visualize my da...  \n",
       "12  \"What are the different storage options availa...  \n",
       "13  \"What database services are available on Googl...  \n",
       "14  \"I need to increase the storage space on my Co...  \n",
       "15     \"How can I scale my database on Google Cloud?\"  \n",
       "16  \"How do I migrate my existing database to Goog...  \n",
       "17  \"How can I monitor the performance and logs of...  \n",
       "18  \"My application is experiencing performance is...  \n",
       "19  \"What is Cloud Logging, and how can I use it t...  \n",
       "20  \"I'm trying to troubleshoot an issue with my a...  \n",
       "21  \"What are some best practices for optimizing m...  \n",
       "22  \"My Cloud Storage costs are higher than expect...  \n",
       "23  \"I'm not using some of my Google Cloud resourc...  \n",
       "24  \"How can I track and manage my Google Cloud co...  \n",
       "25  \"What tools are available for cost management ...  \n",
       "26  \"My Google Cloud bill is higher than expected ...  \n",
       "27  \"My Cloud Function is timing out. How can I in...  \n",
       "28  \"What is serverless computing, and what are it...  \n",
       "29  \"How can I use Google Cloud Run to deploy cont...  \n",
       "30  \"I need to deploy a containerized application ...  \n",
       "31   \"What is Cloud Functions, and how does it work?\"  \n",
       "32  \"I want to build a simple API using Cloud Func...  \n",
       "33   \"How can I upload data to Google Cloud Storage?\"  \n",
       "34  \"I want to make my data in Cloud Storage avail...  \n",
       "35  \"I need to transfer a large amount of data to ...  \n",
       "36  \"I'm trying to train a machine learning model ...  \n",
       "37  \"I'm having trouble configuring IAM to use Ver...  \n",
       "38  \"How can I improve the security of my network ...  \n",
       "39  \"What is Identity and Access Management (IAM),...  \n",
       "40  \"I want to ensure that only authorized users c...  \n",
       "41  \"I'm concerned about the security of my sensit...  \n",
       "42  \"What are firewalls, and how do I configure th...  \n",
       "43  \"How can I connect my on-premises network to G...  \n",
       "44  \"What are the security features of Google Clou...  \n",
       "45  \"I want to connect my Cloud Function to a Clou...  \n",
       "46  \"I want to ensure that my network traffic is s...  \n",
       "47  \"What are security best practices for Google C...  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unrolled_proportional_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "GhK3POqVH1mX",
   "metadata": {
    "collapsed": true,
    "id": "GhK3POqVH1mX",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "df_grouped_by_cluster[\"gemini_representative_questions_len\"] = df_grouped_by_cluster[\"cluster_summary\"].apply(lambda x: len(x.most_representative_qs))\n",
    "df_grouped_by_cluster[\"gemini_representative_questions\"] = df_grouped_by_cluster[\"cluster_summary\"].apply(lambda x: x.most_representative_qs)\n",
    "# Print the resulting DataFrame\n",
    "unrolled_gemini_df = df_grouped_by_cluster.apply(lambda x: pd.Series({\n",
    "    'cluster_title': [x[\"just_summary\"]] * len(x['gemini_representative_questions']),\n",
    "    'representative_question': x['gemini_representative_questions']\n",
    "}), axis=1)\n",
    "\n",
    "# Concatenate the series and reset the index\n",
    "unrolled_gemini_df = pd.concat([unrolled_gemini_df['cluster_title'].explode(),\n",
    "                         unrolled_gemini_df['representative_question'].explode()],\n",
    "                        axis=1).reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "8e2Wxw8XJtWt",
   "metadata": {
    "id": "8e2Wxw8XJtWt"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cluster_title</th>\n",
       "      <th>representative_question</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>How can I create a virtual machine instance on...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>What are the different machine types available...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>Can you explain the different pricing options ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>How do I connect to my Compute Engine instance...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>I accidentally deleted my Compute Engine insta...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>I want to set up a cluster of Compute Engine i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Questions concerning the practical usage and t...</td>\n",
       "      <td>I'm having trouble connecting to my Virtual Ma...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>What is BigQuery, and how can I use it to anal...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>I have a large dataset that I want to analyze ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>My BigQuery queries are taking a long time to ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>I want to visualize my data in BigQuery using ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>How can I use machine learning with BigQuery t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>How can I use Google Cloud to build a data pip...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>How can I use Google Cloud to visualize my data?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>How can I use AutoML to build a machine learni...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>I need to monitor the performance of my deploy...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>How can I use Google Cloud to build a machine ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Questions concerning the practical application...</td>\n",
       "      <td>How can I use Google Cloud to deploy my machin...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>What database services are available on Google...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>What is the difference between Cloud SQL and C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>How do I migrate my existing database to Googl...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>How can I scale my database on Google Cloud?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>My Cloud SQL database is running out of storag...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>I need to replicate my Cloud SQL database to a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Questions concerning Google Cloud database ser...</td>\n",
       "      <td>I'm experiencing slow query performance on my ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>My application is experiencing performance iss...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>My application is experiencing high latency. C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>I want to monitor the performance of my applic...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>How can I monitor the performance and logs of ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Questions concerning the monitoring, troublesh...</td>\n",
       "      <td>What is Cloud Monitoring, and how does it work?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>What are preemptible instances, and how can th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>I'm getting billed for a Compute Engine instan...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>My Cloud Storage costs are higher than expecte...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>My Google Cloud bill is higher than expected t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>I want to track the cost of my Google Cloud re...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>Questions concerning cost optimization strateg...</td>\n",
       "      <td>What are some best practices for optimizing my...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>I need to automate the deployment of my applic...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>What tools are available for managing my Googl...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>How can I automate tasks on Google Cloud?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>What is serverless computing, and what are its...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>What serverless platforms are available on Goo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>How can I build and deploy a serverless applic...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>What is Cloud Functions, and how does it work?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>Questions concerning the automation of applica...</td>\n",
       "      <td>How can I use Google Cloud Run to deploy conta...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>Questions concerning the practical aspects of ...</td>\n",
       "      <td>I need to transfer a large amount of data to G...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>Questions concerning the practical aspects of ...</td>\n",
       "      <td>I accidentally deleted some files from my Clou...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>Questions concerning the practical aspects of ...</td>\n",
       "      <td>I want to make my data in Cloud Storage availa...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>Questions concerning the practical aspects of ...</td>\n",
       "      <td>How much does it cost to store data in Google ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>Questions concerning practical challenges and ...</td>\n",
       "      <td>I'm trying to train a machine learning model o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>Questions concerning practical challenges and ...</td>\n",
       "      <td>I want to deploy my trained machine learning m...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>Questions concerning practical challenges and ...</td>\n",
       "      <td>What is Vertex AI, and how can I use it?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>Questions concerning practical challenges and ...</td>\n",
       "      <td>I'm having trouble configuring IAM to use Vert...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>What are the security features of Google Cloud...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>How do I create a Virtual Private Cloud (VPC) ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>What are firewalls, and how do I configure the...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>55</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>How can I connect my on-premises network to Go...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>How can I secure my applications and data on G...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>57</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>What is Identity and Access Management (IAM), ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>How can I implement multi-factor authenticatio...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>Questions concerning securing Google Cloud res...</td>\n",
       "      <td>What are security best practices for Google Cl...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        cluster_title  \\\n",
       "0   Questions concerning the practical usage and t...   \n",
       "1   Questions concerning the practical usage and t...   \n",
       "2   Questions concerning the practical usage and t...   \n",
       "3   Questions concerning the practical usage and t...   \n",
       "4   Questions concerning the practical usage and t...   \n",
       "5   Questions concerning the practical usage and t...   \n",
       "6   Questions concerning the practical usage and t...   \n",
       "7   Questions concerning the practical application...   \n",
       "8   Questions concerning the practical application...   \n",
       "9   Questions concerning the practical application...   \n",
       "10  Questions concerning the practical application...   \n",
       "11  Questions concerning the practical application...   \n",
       "12  Questions concerning the practical application...   \n",
       "13  Questions concerning the practical application...   \n",
       "14  Questions concerning the practical application...   \n",
       "15  Questions concerning the practical application...   \n",
       "16  Questions concerning the practical application...   \n",
       "17  Questions concerning the practical application...   \n",
       "18  Questions concerning Google Cloud database ser...   \n",
       "19  Questions concerning Google Cloud database ser...   \n",
       "20  Questions concerning Google Cloud database ser...   \n",
       "21  Questions concerning Google Cloud database ser...   \n",
       "22  Questions concerning Google Cloud database ser...   \n",
       "23  Questions concerning Google Cloud database ser...   \n",
       "24  Questions concerning Google Cloud database ser...   \n",
       "25  Questions concerning the monitoring, troublesh...   \n",
       "26  Questions concerning the monitoring, troublesh...   \n",
       "27  Questions concerning the monitoring, troublesh...   \n",
       "28  Questions concerning the monitoring, troublesh...   \n",
       "29  Questions concerning the monitoring, troublesh...   \n",
       "30  Questions concerning cost optimization strateg...   \n",
       "31  Questions concerning cost optimization strateg...   \n",
       "32  Questions concerning cost optimization strateg...   \n",
       "33  Questions concerning cost optimization strateg...   \n",
       "34  Questions concerning cost optimization strateg...   \n",
       "35  Questions concerning cost optimization strateg...   \n",
       "36  Questions concerning the automation of applica...   \n",
       "37  Questions concerning the automation of applica...   \n",
       "38  Questions concerning the automation of applica...   \n",
       "39  Questions concerning the automation of applica...   \n",
       "40  Questions concerning the automation of applica...   \n",
       "41  Questions concerning the automation of applica...   \n",
       "42  Questions concerning the automation of applica...   \n",
       "43  Questions concerning the automation of applica...   \n",
       "44  Questions concerning the practical aspects of ...   \n",
       "45  Questions concerning the practical aspects of ...   \n",
       "46  Questions concerning the practical aspects of ...   \n",
       "47  Questions concerning the practical aspects of ...   \n",
       "48  Questions concerning practical challenges and ...   \n",
       "49  Questions concerning practical challenges and ...   \n",
       "50  Questions concerning practical challenges and ...   \n",
       "51  Questions concerning practical challenges and ...   \n",
       "52  Questions concerning securing Google Cloud res...   \n",
       "53  Questions concerning securing Google Cloud res...   \n",
       "54  Questions concerning securing Google Cloud res...   \n",
       "55  Questions concerning securing Google Cloud res...   \n",
       "56  Questions concerning securing Google Cloud res...   \n",
       "57  Questions concerning securing Google Cloud res...   \n",
       "58  Questions concerning securing Google Cloud res...   \n",
       "59  Questions concerning securing Google Cloud res...   \n",
       "\n",
       "                              representative_question  \n",
       "0   How can I create a virtual machine instance on...  \n",
       "1   What are the different machine types available...  \n",
       "2   Can you explain the different pricing options ...  \n",
       "3   How do I connect to my Compute Engine instance...  \n",
       "4   I accidentally deleted my Compute Engine insta...  \n",
       "5   I want to set up a cluster of Compute Engine i...  \n",
       "6   I'm having trouble connecting to my Virtual Ma...  \n",
       "7   What is BigQuery, and how can I use it to anal...  \n",
       "8   I have a large dataset that I want to analyze ...  \n",
       "9   My BigQuery queries are taking a long time to ...  \n",
       "10  I want to visualize my data in BigQuery using ...  \n",
       "11  How can I use machine learning with BigQuery t...  \n",
       "12  How can I use Google Cloud to build a data pip...  \n",
       "13   How can I use Google Cloud to visualize my data?  \n",
       "14  How can I use AutoML to build a machine learni...  \n",
       "15  I need to monitor the performance of my deploy...  \n",
       "16  How can I use Google Cloud to build a machine ...  \n",
       "17  How can I use Google Cloud to deploy my machin...  \n",
       "18  What database services are available on Google...  \n",
       "19  What is the difference between Cloud SQL and C...  \n",
       "20  How do I migrate my existing database to Googl...  \n",
       "21       How can I scale my database on Google Cloud?  \n",
       "22  My Cloud SQL database is running out of storag...  \n",
       "23  I need to replicate my Cloud SQL database to a...  \n",
       "24  I'm experiencing slow query performance on my ...  \n",
       "25  My application is experiencing performance iss...  \n",
       "26  My application is experiencing high latency. C...  \n",
       "27  I want to monitor the performance of my applic...  \n",
       "28  How can I monitor the performance and logs of ...  \n",
       "29    What is Cloud Monitoring, and how does it work?  \n",
       "30  What are preemptible instances, and how can th...  \n",
       "31  I'm getting billed for a Compute Engine instan...  \n",
       "32  My Cloud Storage costs are higher than expecte...  \n",
       "33  My Google Cloud bill is higher than expected t...  \n",
       "34  I want to track the cost of my Google Cloud re...  \n",
       "35  What are some best practices for optimizing my...  \n",
       "36  I need to automate the deployment of my applic...  \n",
       "37  What tools are available for managing my Googl...  \n",
       "38          How can I automate tasks on Google Cloud?  \n",
       "39  What is serverless computing, and what are its...  \n",
       "40  What serverless platforms are available on Goo...  \n",
       "41  How can I build and deploy a serverless applic...  \n",
       "42     What is Cloud Functions, and how does it work?  \n",
       "43  How can I use Google Cloud Run to deploy conta...  \n",
       "44  I need to transfer a large amount of data to G...  \n",
       "45  I accidentally deleted some files from my Clou...  \n",
       "46  I want to make my data in Cloud Storage availa...  \n",
       "47  How much does it cost to store data in Google ...  \n",
       "48  I'm trying to train a machine learning model o...  \n",
       "49  I want to deploy my trained machine learning m...  \n",
       "50           What is Vertex AI, and how can I use it?  \n",
       "51  I'm having trouble configuring IAM to use Vert...  \n",
       "52  What are the security features of Google Cloud...  \n",
       "53  How do I create a Virtual Private Cloud (VPC) ...  \n",
       "54  What are firewalls, and how do I configure the...  \n",
       "55  How can I connect my on-premises network to Go...  \n",
       "56  How can I secure my applications and data on G...  \n",
       "57  What is Identity and Access Management (IAM), ...  \n",
       "58  How can I implement multi-factor authenticatio...  \n",
       "59  What are security best practices for Google Cl...  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unrolled_gemini_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90LThbX4ov-5",
   "metadata": {
    "id": "90LThbX4ov-5"
   },
   "source": [
    "### Save Results to CSV\n",
    "- We do need to obtain ground truth answers\n",
    "- But we can be confident we are putting the effort towards relevant, representative questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "l_fF_K2Tebib",
   "metadata": {
    "id": "l_fF_K2Tebib"
   },
   "outputs": [],
   "source": [
    "unrolled_gemini_df.to_csv(\"representative_eval_questions.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80cacc08",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "With this notebook you can go from a mass of user queries from a RAG system and get immediate insights into the types of queries people are asking with useful clusters of queries described and analyzed by Gemini. This analysis can help inform decisions around how to improve the RAG system or it may highlight other issues in the business or product beyond what the chatbot can address. Finally, you can sample queries from these clusters to get a representative set of evaluation questions with which you can use to continuously evaluate the RAG system over time.\n",
    "\n",
    "As a next step will be to take this set of representative questions and obtain ground truth from users or subject matter experts and then evaluating performance using a service like [Vertex AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview)."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [
    "UDNXBCiOcLbK",
    "Gw_Og38p4uZE"
   ],
   "name": "curate_new_evals_share.ipynb",
   "provenance": []
  },
  "environment": {
   "kernel": "conda-base-py",
   "name": "workbench-notebooks.m128",
   "type": "gcloud",
   "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m128"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "conda-base-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
